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A  Neural  Network  for  Feature  Extraction 


Nathan  Intrator 
Div.  of  Applied  Mathematics,  and 
Center  for  Neural  Science 
Brown  University 
Providence,  RI  02912 


ABSTRACT 

The  paper  suggests  a  statistical  framework  for  the  parameter  esti¬ 
mation  problem  associated  with  unsupervised  learning  in  a  neural 
network,  leading  to  an  exploratory  projection  pursuit  network  that 
performs  feature  extraction,  or  dimensionality  reduction. 

1  INTRODUCTION 

The  search  for  a  possible  presence  of  some  unspecified  structure  in  a  high  dimen¬ 
sional  space  can  be  difficult  due  to  the  curse  of  dimensionality  problem,  namely 
the  inherent  sparsity  of  high  dimensional  spaces.  Due  to  this  problem,  uniformly 
accurate  estimations  for  all  smooth  functions  are  not  possible  in  high  dimensions 
with  practical  sample  sizes  (Cox,  1984,  Barron,  1988). 

Recently,  exploratory  projection  pursuit  (PP)  has  been  considered  (Jones,  1983)  as  a 
potential  method  for  overcoming  the  curse  of  dimensionality  problem  (Huber,  1985), 
and  new  algorithms  were  suggested  by  Friedman  (1987),  and  by  Hall  (1988,  1989). 
The  idea  is  to  find  low  dimensional  projections  that  provide  the  most  revealing 
views  of  the  full-dimensional  data  emphasizing  the  discovery  of  nonlinear  effects 
such  as  clustering. 

Many  of  the  methods  of  classical  multivariate  analysis  turn  out  to  be  special  cases 
of  PP  methods.  Examples  are  principal  component  analysis,  factor  analysis,  and 
discriminant  analysis.  The  various  PP  methods  differ  by  the  projection  index  opti¬ 
mized. 


Neural  networks  seem  promising  for  feature  extraction,  or  dimensionality  reduction, 
mainly  because  of  their  powerful  parallel  computation.  Feature  detecting  functions 
of  neurons  have  been  studied  in  the  past  two  decades  (von  der  Malsburg,  1973,  Nass 
et  al.,  1973,  Cooper  et  al.,  1979,  Takeuchi  and  Amari,  1979).  It  has  also  been  shown 
that  a  simplified  neuron  model  can  serve  as  a  principal  component  analyzer  (Oja, 
1982). 

This  paper  suggests  a  statistical  framework  for  the  parameter  estimation  problem 
associated  with  unsupervised  learning  in  a  neural  network,  leading  to  an  exploratory 
PP  network  that  performs  feature  extraction,  or  dimensionality  reduction,  of  the 
training  data  set.  The  formulation,  which  is  similar  in  nature  to  PP,  is  based  on 
a  minimization  of  a  cost  function  over  a  set  of  parameters,  yielding  an  optimal 
decision  rule  under  some  norm.  First,  the  formulation  of  a  single  and  a  multiple 
feature  extraction  are  presented.  Then  a  new  projection  index  (cost  function)  that 
favors  directions  possessing  multimodality,  where  the  multimodality  is  measured 
in  terms  of  the  separability  property  of  the  data,  is  presented.  This  leads  to  the 
synaptic  modification  equations  governing  learning  in  Bienenstock,  Cooper,  and 
Munro  (BCM)  neurons  (1982).  A  network  is  presented  based  on  the  multiple  feature 
extraction  formulation,  and  both,  the  linear  and  nonlinear  neurons  are  analysed. 


2  SINGLE  FEATURE  EXTRACTION 


We  associate  a  feature  with  each  projection  direction.  With  the  addition  of  a 
threshold  function  we  can  say  that  an  input  posses  a  feature  associated  with  that 
direction  if  its  projection  onto  that  direction  is  larger  than  the  threshold.  In  these 
terms,  a  one  dimensional  projection  would  be  a  single  feature  extraction. 

The  approach  proceeds  as  follows:  Given  a  compact  set  of  parameters,  define  a 
family  of  loss  functions,  where  the  loss  function  corresponds  to  a  decision  made  by 
the  neuron  whether  to  fire  or  not  for  a  given  input.  Let  the  risk  be  the  averaged 
loss  over  all  inputs.  Minimize  the  risk  over  all  possible  decision  rules,  and  then 
minimize  the  risk  over  the  parameter  set.  In  case  the  risk  does  not  yield  a  meaningful 
minimization  problem,  or  when  the  parameter  set  over  which  the  minimization  takes 
place  can  be  restricted  by  some  a-priori  knowledge,  a  penalty,  i.e.  a  measure  on  the 
parameter  set,  may  be  added  to  the  risk. 

Define  the  decision  problem  (fb  Jp.  Pi  L,  A),  where  f2  =  (z-*1*, . . . ,  x*nl),  x(,)  6  /?'"  , 
is  a  fixed  set  of  input  vectors,  (12,  Fn,  P)  the  corresponding  probability  space,  A  — 
{0, 1}  the  decision  space,  and  Lg  :  12  x  A  >— >  R  is  the  family  of  loss 

functions.  BM  is  a  compact  set  in  RM  ■  Let  V  be  the  space  of  all  decision  rules. 
The  risk  Rg  :  V  >— > >  R,  is  given  by: 

n 

Re(6)  =  £>(*f>)M*<*>,6(x''‘>)).  (2-1) 

i — t 

For  a  fixed  0,  the  optimal  decision  6g  is  chosen  so  that: 


R»(6g)  =  min{f?#(6)} 

6  £  T) 


(2.2) 
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Since  the  minimization  takes  place  over  a  finite  set,  the  minimizer  exists.  In  par¬ 
ticular,  for  a  given  xW  the  decision  6g(x^‘^)  is  chosen  so  that  Lg(xt‘,'> ,  6g(x^,'>))  < 
1  -  6e(x(*))). 

Now  we  find  an  optimal  8  that  minimizes  the  risk,  namely,  8  will  be  such  that: 

RS(6§)  =  mmjRg(Sg)}.  (2.3) 

The  minimum  with  respect  to  6  exits  since  BM  is  compact. 

Rg(6g)  becomes  a  function  that  depends  only  on  8,  and  when  8  represents  a  vector 
in  RN ,  Rg  can  be  viewed  as  a  projection  index 

3  MULTI-DIMENSIONAL  FEATURE  EXTRACTION 

In  this  case  we  have  a  single  layer  network  of  interconnected  units,  each  performing 
a  single  feature  extraction.  All  units  receive  the  same  input  and  the  interaction  be¬ 
tween  the  units  is  via  lateral  inhibition.  The  formulation  is  similar  to  single  feature 
extraction,  with  the  addition  of  interaction  between  the  single  feature  extractors. 
Let  Q  be  the  number  of  features  to  be  extracted  from  the  data.  The  multiple  de¬ 
cision  rule  Sg  =  (SgX\ . . . ,  <5^)  takes  values  in  A  —  {0,1}^.  The  risk  of  node  k 
is  given  by:  Rgk\i)  =  P(x^)L^\x^'\ 6lk)(xP')),  and  the  total  risk  of  the 

network  is  Rg(6 )  =  Proceeding  as  before,  we  can  minimize  over  the 

decision  rules  6  to  get  6g,  and  then  minimize  over  8  to  get  8,  as  in  equation  (2.3). 

The  coupling  of  the  equations  via  the  inhibition,  and  the  relation  between  the 
different  features  extracted  is  exhibited  in  the  loss  function  for  each  node  and  will 
become  clear  through  the  next  example. 

4  FINDING  THE  OPTIMAL  9  FOR  A  SPECIFIC  LOSS 
FUNCTION 

4.1  A  SINGLE  BCM  NEURON  -  ONE  FEATURE  EXTRACTION 

In  this  section,  we  present  an  exploratory  PP  method  with  a  specific  loss  function. 
The  differential  equations  performing  the  optimization  turn  out  to  be  a  good  ap¬ 
proximation  of  the  law  governing  synaptic  weight  modification  in  the  BCM  theory 
for  learning  and  memory  in  neurons.  The  forma!  presentation  of  the  theory,  and 
some  theoretical  analysis  is  given  in  (Bie;  i  '  ek,  1980,  Bienenstock  et  a].,  1982), 
mean  field  theory  for  a  network  based  oi  *he,r-  neurons  is  presented  in  (Scofield 
and  Cooper,  1985,  Cooper  and  Scofield,  19  '  .note  recent  analysis  based  on  the 
statistical  viewpoint  is  in  (Intrator  1990),  computer  simulations  and  the  biological 
relevance  are  discussed  in  (Saul  et  al.,  1986,  Bear  et  al.,  1987,  Cooper  et  al.,  1988). 

We  start  with  a  short  review  of  the  notations  and  definitions  of  BCM  theory. 
Consider  a  neuron  with  input  vector  x  —  (xj, . . . ,  xjy),  synaptic  weights  vector 
m  =  (mi, . . . ,  mjv),  both  in  RN ,  and  activity  (in  the  linear  region)  c  —  x  ■  m. 


Define  0m  =  E[(x-m)-],  <p(c,Qm )  =  c~  -  |c©m,  <j>(c,  ©m)  =  c2  -  |c0m.  The  input 
x,  which  is  a  stochastic  process,  is  assumed  to  be  of  Type  II  mixing,  bounded,  and 
piecewise  constant.  The  y>  mixing  property  specifies  the  dependency  of  the  future 
of  the  process  on  its  past.  These  assumptions  are  needed  for  the  approximation  of 
the  resulting  deterministic  equation  by  a  stochastic  one  and  are  discussed  in  detail 
in  (Intrator,  1990).  Note  that  c  represents  the  linear  projection  of  x  onto  m ,  and 
we  seek  an  optimal  projection  in  some  sense. 

The  BCM  synaptic  modification  equations  are  given  by:  m  =  fi(t)<p(x  ■  m,Qm)x, 
m(0)  =  mo,  where  p(<)  is  a  global  modulator  which  is  assumed  to  take  into  account 
all  the  global  factors  affecting  the  cell,  e.g.,  the  beginning  or  end  of  the  critical 
period,  state  of  arousal,  etc. 

Rewriting  the  modification  equation  as  m  =  fi{t)(x  •  m)(x  •  m  —  1 #m)x,  we  see 
that  unlike  a  classical  Hebb-Stent  rule,  the  threshold  &m  is  dynamic.  This  gives 
the  modification  equation  the  desired  stability,  with  no  extra  conditions  such  as 
saturation  of  the  activity,  or  normalization  of  ||  m  ||,  and  also  yields  a  statistically 
meaningful  optimization. 

Returning  to  the  statistical  formulation,  we  let  9  —  m  be  the  parameter  to  be 
estimated  according  to  the  above  formulation  and  define  an  appropriate  loss  function 
depending  on  the  cell’s  decision  whether  to  fire  or  not.  The  loss  function  represents 
the  intuitive  idea  that  the  neuron  will  fire  when  its  activity  is  greater  than  some 
threshold,  and  will  not  otherwise.  We  denote  the  firing  of  the  neuron  by  a  =  1. 
Define  K  =  —n  jJ@'n  <j>(s,Qm)ds.  Consider  the  following  loss  function: 


Le(x,a) 


Lm(x,a) 


@m)ds, 

K-pfi &m)ds, 

.  K  -  4>(s,  &m)ds, 


(x  ■  in)  >  ©m,  a  =  1 
(x  ■  m)  <  &m,  a  =  1 
{x  ■  m)  <  Qm,  a  =  0 
(x  ■  m)  >  Qm,  a  =  0 


(4.1) 


It  follows  from  the  definition  of  Lg  and  from  the  definition  of  6e  in  (2.2)  that 


Lm(x,6m)  =  -nf  <fl(s,Gm)ds  =  ~^{(x  '  m)3  -  &[(*  ■  ™)2}(x  ■  m)2}  (4.2) 


The  above  definition  of  the  loss  function  suggests  that  the  decision  of  a  neuron 
whether  to  fire  or  not  is  based  on  a  dynamic  threshold  (x  ■  m)  >  0m.  It  turns  out 
that  the  synaptic  modification  equations  remain  the  same  if  the  decision  is  based 
on  a  fixed  threshold.  This  is  demonstrated  by  the  following  loss  function,  which 
leads  to  the  same  risk  as  in  equation  (4.3):  K  =  — /t  /03<3'"  d>(*.  ®m)ds, 


[  -fi  fo*  d>(«,  ©m)ds,  (t  -  m)  >  0,  a  =  1 

K  ~  n  fo*  m>  &(s,  0m  )ds,  (x  ■  m)  <  0,  a  =  1 

- H  JqX  m>  d>(s,  Om)ds,  (x  ■  m )  <0,  a  =  0 

K  —  n  fg*  rn  1  &m  )ds,  (x  ■  m)  >  0,  a  —  0 


Le(x,a)  =  Lm(x,a)  =  < 


(4.1') 


The  risk  is  given  by: 


MW  =  -£{£[(*  •  ^)3]  -  E*[(x  ■  in)2]}.  (4.3) 

The  following  graph  represents  the  <f>  function  and  the  associated  loss  function 
Lm(x,< 5m)  of  the  activity  c. 


Fig.  1:  The  Function  <j>  and  the  Loss  Functions  for  a  Fixed  m  and  ©m. 

From  the  graph  of  the  loss  function  it  follows  that  for  any  fixed  m  and  ©m,  the  loss 
is  small  for  a  given  input  x,  when  either  x  •  in  is  close  to  zero  or  negative,  or  when 
x  •  m  is  larger  than  0m.  This  suggests,  that  the  preferred  directions  for  a  fixed  0m 
will  be  such  that  the  projected  single  dimensional  distribution  differs  from  normal 
in  the  center  of  the  distribution,  in  the  sense  that  it  has  a  multi-modal  distribution 
with  a  distance  between  the  two  peaks  larger  than  #m.  Rewriting  (4.3)  we  get 

MW  _  p  g[(s-m)»] 

£2[(x-m)2]  3l£2[(*-m)*]  ''  1  ’ 

The  term  £[(x  •  m)3)/E2[(x  ■  m)2]  can  be  viewed  as  some  measure  of  the  skewness  of 
the  distribution,  which  is  a  measure  of  deviation  from  normality  and  therefore  an 
interesting  direction  (Diaconis  and  Friedman,  1984),  in  accordance  with  Friedman 
(1987)  and  Hall’s  (1988,  1989)  argument  that  it  is  best  to  seek  projections  that 
differ  from  the  normal  in  the  center  of  the  distribution  rather  than  in  the  tails. 

Since  the  risk  is  continuously  differentiable,  its  minimization  can  be  done  via  the 
gradient  descent  method  with  respect  to  in,  namely: 

dn  d 

=  ~g^MW  =  ^E{<i>(x  •  m,©ra)x,].  (4.5) 

Notice  that  the  resulting  equation  represents  an  averaged  deterministic  equation 
of  the  stochastic  BCM  modification  equations.  It  turns  out  that  under  suitable 
conditions  on  the  mixing  of  the  input  x  and  the  global  function  p,  equation  (4.5)  is 
a  good  approximation  of  its  stochastic  version. 

When  the  nonlinearity  of  the  neuron  is  emphasized,  the  neuron's  activity  is  then 
defined  as  c  =  <r(x  •  m),  where  a  usually  represents  a  smooth  sigmoidal  function. 
©m  is  then  defined  as  E[cr2(x  •  m)],  and  the  loss  function  is  similar  to  the  one 
given  by  equation  (4.1)  except  that  (x  •  m)  is  replaced  by  cr(x  •  in).  The  gradient  of 


the  risk  is  given  by:  —  Vmi?m(6m)  =  •  m),0m^cr'i],  where  a1  represents 

the  derivative  of  a  at  the  point  (a  •  m).  Note  that  a  may  represent  any  nonlinear 
function,  e.g.  radial  symmetric  kernels. 

4.2  THE  NETWORK  -  MULTIPLE  FEATURE  EXTRACTION 

In  this  case  we  have  Q  identical  nodes,  which  receive  the  same  input  and  inhibit 
each  other.  Let  the  neuronal  activity  be  denoted  by  c*  =  x  ■  m-k.  We  define  the 
inhibited  activity  c*  =  c*  —  an<^  t^le  threshold  ©{^  =  Ejcj).  In  a  more 

general  case,  the  inhibition  may  be  defined  to  take  into  account  the  spatial  location 
of  adjacent  neurons,  namely,  c*  =  Aj*Cj,  where  Aj*  represents  different  types 
of  inhibitions,  e.g.  Mexican  hat.  Since  the  following  calculations  are  valid  for  both 
kinds  of  inhibition  we  shall  introduce  only  the  simpler  one. 

The  loss  function  is  similar  to  the  one  defined  in  a  single  feature  extraction  with  the 
exception  that  the  activity  c  —  x-m  is  replaced  by  c.  Therefore  the  risk  for  node  k  is 
given  by:  Rk  =  —  !(£[<:*]  —  (^[c^])2},  and  the  total  risk  is  given  by  R  =  V^=1  Rk. 
The  gradient  of  R  is  given  by: 

a  n 

—  =  -Ml  -  V(Q  -  1  )}E[4>(ck,  e*,)*].  (4.6) 

Equation  (4.6)  demonstrates  the  ability  of  the  network  to  perform  exploratory  pro¬ 
jection  pursuit  in  parallel,  since  the  minimization  of  the  risk  involves  minimization 
of  nodes  1, _ ,  Q,  which  are  loosely  coupled. 

The  parameter  r/  represents  the  amount  of  lateral  inhibition  in  the  network,  and 
is  related  to  the  amount  of  correlation  between  the  different  features  sought  by 
the  network.  Experience  shows  that  when  tj  ~  0,  the  different  units  may  all  be 
come  selective  to  the  simplest  feature  that  can  be  extracted  from  the  data.  When 
r) (Q  —  1)  ~  1,  the  network  becomes  selective  to  those  inputs  that  are  very  far  apart 
(under  the  l 2  norm),  yielding  a  classification  of  a  small  portion  of  the  data,  and 
mostly  unresponsiveness  to  the  rest  of  the  data.  When  0  <  i](Q  —  1)  <  1,  the  net¬ 
work  becomes  responsive  to  substructures  that  may  be  common  to  several  different 
inputs,  namely  extract  invariant  features  in  the  data.  The  optimal  value  of  j j  has 
been  estimated  by  data  driven  techniques. 

When  the  non  linearity  of  the  neuron  is  emphasized  the  activity  is  defined  (as  in 
the  single  neuron  case)  as  c*  =  a(x  -m^).  ck,  ©Jj,,  and  Rk  are  defined  as  before.  In 
this  case  =  —t)<t'(x  ■  m})x,  =  a'( x  ■  mk)x,  and  equation  (4.6)  becomes: 


OR 

dm* 


-/zE^(c*,0^)(<T'(jr  •  m*)  -  >/ VV(.r  .  m,))j] 

)7-k 


(4.7) 


4.3  OPTIMAL  NETWORK  SIZE 

A  major  problem  in  network  solutions  to  real  world  problems  is  optimal  network 
size.  In  our  case,  it  is  desirable  to  try  and  extract  as  many  features  as  possible  on 


one  hand,  but  it  is  clear  that  too  many  neurons  in  the  network  will  simply  inhibit 
each  other,  yielding  sub-optimal  results.  The  following  solution  was  adopted:  We 
replace  each  neuron  in  the  network  with  a  group  of  neurons  which  all  receive  the 
same  input,  and  the  same  inhibition  from  adjacent  groups.  These  neurons  differ 
from  one  another  only  in  their  initial  synaptic  weights.  The  output  of  each  neuron 
is  replaced  by  the  average  group  activity.  Experiments  show  that  the  resulting 
network  is  more  robust  to  noise  and  outliers  in  the  data.  Furthermore,  it  is  observed 
that  groups  that  become  selective  to  a  true  feature  in  the  data,  posses  a  much 
smaller  inter-group  variance  of  their  synaptic  weight  vector  than  those  which  do 
not  become  responsive  to  a  coherent  feature.  We  found  that  eliminating  neurons 
with  large  inter-group  variance  and  retraining  the  network,  may  yield  improved 
feature  extraction  properties. 

The  network  has  been  applied  to  speech  segments,  in  an  attempt  to  extract  some 
features  from  CV  pairs  of  isolated  phonemes  (Seebach  and  Intrator,  1988). 

5  DISCUSSION 

The  PP  method  based  on  the  BCM  n  odification  function,  has  been  found  capable  of 
effectively  discovering  non  linear  data  structures  in  high  dimensional  spaces.  Using 
a  parallel  processor  and  the  presented  network  topology,  the  pursuit  can  be  done 
faster  than  in  the  traditional  serial  methods. 

The  projection  index  is  based  on  polynomial  moments,  and  is  therefore  computa¬ 
tionally  attractive.  When  only  the  nonlinear  structure  in  the  data  is  of  interest,  a 
sphering  transformation  (Huber,  1981,  Friedman,  1987),  can  be  applied  first  to  the 
data  for  removal  of  all  the  location,  scale,  and  correlational  structure  from  the  data. 

When  compared  with  other  PP  methods,  the  highlights  of  the  presented  method  are 
i)  the  projection  index  concentrates  on  directions  where  the  separability  property  as 
well  as  the  non-normality  of  the  data  is  large,  thus  giving  rise  to  better  classification 
properties;  it)  the  degree  of  correlation  between  the  directions,  or  features  extracted 
by  the  network  can  be  regulated  via  the  global  inhibition,  allowing  some  tuning  of 
the  network  to  different  types  of  data  for  optimal  results;  Hi)  the  pursuit  is  done  on 
all  the  directions  at  once  thus  leading  to  the  capability  of  finding  more  interesting 
structures  than  methods  that  find  only  one  projection  direction  at  a  time,  i v)  the 
network’s  structure  suggests  a  simple  method  for  size-optimization. 
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